Skip to content

Ship 8 Tranche 7c: high-bit 4:4:4 RGBA u16 SIMD + sinker integration#31

Merged
uqio merged 2 commits intomainfrom
feat/ship8-rgba-high-bit-444-u16-simd
Apr 27, 2026
Merged

Ship 8 Tranche 7c: high-bit 4:4:4 RGBA u16 SIMD + sinker integration#31
uqio merged 2 commits intomainfrom
feat/ship8-rgba-high-bit-444-u16-simd

Conversation

@uqio
Copy link
Copy Markdown
Collaborator

@uqio uqio commented Apr 27, 2026

Summary

Closes Ship 8 high-bit 4:4:4 RGBA. Wires u16 RGBA SIMD across all 5 backends (NEON, SSE4.1, AVX2, AVX-512, wasm simd128), wires the 8 u16 RGBA dispatchers in src/row/mod.rs, and lands sinker-level integration: with_rgba (u8) + with_rgba_u16 (u16) builders for 10 sinker formats with Strategy A combine paths. Mirrors PR #26 (Tranche 5b) exactly, which did the same for 4:2:0.

After this lands, every YUV format in the inventory has packed RGBA output via MixedSinker<F>::with_rgba / with_rgba_u16 — closing the sink-side RGBA gap that motivated Ship 8.

Changes

SIMD u16 RGBA (5 backends × 4 kernel families = 20 kernel refactors)

Each backend's existing u16 RGB kernel becomes a thin wrapper over a const-ALPHA template, alongside a new RGBA u16 wrapper:

Family Const-ALPHA template RGB wrapper RGBA wrapper
Yuv444p_n u16 (BITS-generic) yuv_444p_n_to_rgb_or_rgba_u16_row<BITS, ALPHA> yuv_444p_n_to_rgb_u16_row<BITS> yuv_444p_n_to_rgba_u16_row<BITS>
Yuv444p16 u16 (16-bit dedicated) yuv_444p16_to_rgb_or_rgba_u16_row<ALPHA> yuv_444p16_to_rgb_u16_row yuv_444p16_to_rgba_u16_row
P_n_444 u16 (BITS-generic) p_n_444_to_rgb_or_rgba_u16_row<BITS, ALPHA> p_n_444_to_rgb_u16_row<BITS> p_n_444_to_rgba_u16_row<BITS>
P_n_444_16 u16 (P416) p_n_444_16_to_rgb_or_rgba_u16_row<ALPHA> p_n_444_16_to_rgb_u16_row p_n_444_16_to_rgba_u16_row

Only the per-iteration store and scalar tail dispatch branch on ALPHA; per-pixel math is unchanged. Alpha contracts:

  • BITS-generic kernels: alpha = (1 << BITS) - 1 (low-bit-packed at native depth)
  • 16-bit dedicated kernels: alpha = 0xFFFF

Per-arch alpha splat: vdupq_n_u16(out_max as u16) (NEON) / _mm_set1_epi16(out_max) (x86, with -1i16 for 16-bit) / u16x8_splat(out_max as u16) (wasm). RGBA u16 store helpers (vst4q_u16, write_rgba_u16_8, write_rgba_u16_32, write_quarter_rgba) reused verbatim from PR #26's 4:2:0 work — no new helpers needed.

Dispatcher wiring (8 u16 RGBA dispatchers in src/row/mod.rs)

Replace the 8 let _ = use_simd; // SIMD per-arch routes land in Ship 8 Tranche 7c. stubs (landed in PR #29) with the standard cfg_select! per-arch route block:

  • yuv444p9_to_rgba_u16_row, yuv444p10_to_rgba_u16_row, yuv444p12_to_rgba_u16_row, yuv444p14_to_rgba_u16_row (BITS-generic planar)
  • yuv444p16_to_rgba_u16_row (16-bit dedicated planar)
  • p410_to_rgba_u16_row, p412_to_rgba_u16_row (BITS-generic Pn)
  • p416_to_rgba_u16_row (16-bit dedicated Pn)

use_simd = false still forces scalar. Section header doc updated to reflect u16 RGBA is now SIMD-wired.

Sinker integration (10 formats × 4 builders + Strategy A combine)

src/sinker/mixed/subsampled_4_4_4_high_bit.rs (8 formats): Yuv444p9/10/12/14/16, P410/P412/P416 each gain with_rgba / set_rgba / with_rgba_u16 / set_rgba_u16. Each format's process() is restructured to consume the new buffers via Strategy A:

  • u16 path: rgba_u16-only routes through *_to_rgba_u16_row; rgb_u16 + rgba_u16 runs the RGB kernel once and fans out via expand_rgb_u16_to_rgba_u16_row::<BITS>.
  • u8 path: same shape — rgba-only goes direct; rgb + rgba (or hsv + rgba) uses scratch + expand_rgb_to_rgba_row fan-out.

src/sinker/mixed/subsampled_4_2_2_high_bit.rs (2 formats): Yuv440p10 and Yuv440p12 were the explicit deferral from PR #28's "out of scope" note — they reuse the 4:4:4 dispatchers (yuv444p10/12_to_rgba(_u16)_row), which only became available with this PR. Now wired.

40 new builder methods total (4 × 10 formats); 10 process() restructures. All Strategy A helpers (expand_rgb_to_rgba_row, expand_rgb_u16_to_rgba_u16_row::<BITS>, rgba_plane_row_slice, rgba_u16_plane_row_slice) reused verbatim from PRs #20/#26.

Tests

Per-backend u16 RGBA equivalence (~30 tests): 6 per backend × 5 backends, mirroring PR #26's structure. Each backend covers all 4 kernel families across narrow + tail + 1920 widths, full ColorMatrix × range cross-product. All 18 new x86 #[test] functions include is_x86_feature_detected! early-return guards (per the PR #25 CI fallout — without them, ASAN sanitizer hits SIGILL and Miri reports UB on runners lacking the feature). NEON tests use #[cfg_attr(miri, ignore = \"...\")]. Wasm is module-level cfg-gated.

Sinker tests (9): Representative coverage of Yuv444p10 (BITS-generic planar — both u8 + u16 + Strategy A combine + buffer-too-short err), P410 (BITS-generic Pn semi-planar), Yuv444p16 (16-bit dedicated kernel), and Yuv440p10 (proves the 4:4:0 → 4:4:4 kernel reuse path works end-to-end). Matches PR #26's coverage scope.

Doc-fail example update

The compile_fail doctest in src/sinker/mixed/planar_8bit.rs previously demonstrated the type-system rejection by attempting with_rgba on Yuv444p10. After this PR, every YUV format in the inventory writes RGBA, so the example now points at Bayer (RAW source, no inherent alpha plane — genuinely lacks with_rgba).

Test plan

  • cargo test --lib: 534 pass on aarch64-darwin (host); was 519 → +6 NEON-side u16 RGBA + 9 sinker tests
  • cargo check --tests --lib clean across host, x86_64-unknown-freebsd, wasm32-unknown-unknown
  • RUSTFLAGS=\"-Dwarnings\" cargo clippy --lib --tests clean on host
  • cargo test --doc passes (the new Bayer compile_fail example correctly fails to compile)
  • Zero dead_code warnings — every new *_to_rgba_u16_row wrapper is consumed by its dispatcher; every dispatcher is consumed by a sinker or remains available for direct row callers

Codex adversarial review

Verdict: not run — Codex hit its OpenAI usage rate limit (9:08 PM retry window). The structural pattern is identical to PR #26 (4:2:0) which Codex approved. Re-run available on request once the rate limit clears.

Closes Ship 8 high-bit 4:4:4 (Tranche 7)

After this PR, Ship 8's Tranche 7 row in CHANGELOG.md flips to ✅ shipped:

The remaining Ship 8 work item is Ship 8b (source-side YUVA — separate follow-up; out of scope for Ship 8).

🤖 Generated with Claude Code

@al8n al8n changed the title update Ship 8 Tranche 7c: high-bit 4:4:4 RGBA u16 SIMD + sinker integration Apr 27, 2026
@al8n al8n requested a review from Copilot April 27, 2026 05:37
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Completes Ship 8 Tranche 7c by wiring high-bit 4:4:4 native-depth u16 RGBA SIMD across all supported backends, exposing the u16 RGBA row dispatchers, and integrating RGBA/RGBA-u16 output buffers into the relevant MixedSinker high-bit formats (including the previously-deferred Yuv440p10/12 reuse path).

Changes:

  • Added u16 RGBA SIMD wrappers/templates across NEON, SSE4.1, AVX2, AVX-512, and wasm simd128 backends (plus per-backend scalar equivalence tests).
  • Wired the 8 high-bit 4:4:4 u16 RGBA public row dispatchers in src/row/mod.rs to the per-arch SIMD backends (with scalar fallback).
  • Added sinker-level RGBA/RGBA-u16 integration tests and updated the compile-fail doctest negative example to use raw::Bayer.

Reviewed changes

Copilot reviewed 15 out of 15 changed files in this pull request and generated no comments.

Show a summary per file
File Description
src/row/mod.rs Wires high-bit 4:4:4 u16 RGBA dispatchers to SIMD backends with scalar fallback.
src/row/arch/neon.rs Adds const-ALPHA u16 RGBA wrappers and shared impls for NEON 4:4:4 kernels.
src/row/arch/neon/tests.rs Adds NEON u16 RGBA equivalence tests vs scalar reference.
src/row/arch/x86_sse41.rs Refactors u16 RGB kernels into shared RGB/RGBA templates and adds u16 RGBA wrappers.
src/row/arch/x86_sse41/tests.rs Adds SSE4.1 u16 RGBA equivalence tests with runtime feature detection.
src/row/arch/x86_avx2.rs Refactors AVX2 u16 RGB kernels into shared RGB/RGBA templates and adds u16 RGBA wrappers.
src/row/arch/x86_avx2/tests.rs Adds AVX2 u16 RGBA equivalence tests with runtime feature detection.
src/row/arch/x86_avx512.rs Refactors AVX-512 u16 RGB kernels into shared RGB/RGBA templates and adds u16 RGBA wrappers.
src/row/arch/x86_avx512/tests.rs Adds AVX-512 u16 RGBA equivalence tests with runtime feature detection.
src/row/arch/wasm_simd128.rs Refactors wasm simd128 u16 RGB kernels into shared RGB/RGBA templates and adds u16 RGBA wrappers.
src/row/arch/wasm_simd128/tests.rs Adds wasm simd128 u16 RGBA equivalence tests (cfg-gated on target_feature="simd128").
src/sinker/mixed/subsampled_4_2_2_high_bit.rs Adds with_rgba/with_rgba_u16 (and setters) for Yuv440p10/12 and integrates Strategy A fan-out where applicable.
src/sinker/mixed/planar_8bit.rs Updates compile-fail doctest negative example to use MixedSinker<Bayer> since YUV formats now support RGBA.
src/sinker/mixed/tests.rs Adds representative sinker integration tests for high-bit 4:4:4 RGBA/u16 RGBA and the Yuv440p10 reuse path.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@al8n al8n force-pushed the feat/ship8-rgba-high-bit-444-u16-simd branch from 94002a7 to 0acb53c Compare April 27, 2026 06:29
@uqio uqio merged commit 272dd03 into main Apr 27, 2026
43 checks passed
@uqio uqio deleted the feat/ship8-rgba-high-bit-444-u16-simd branch April 27, 2026 08:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants